"Predicting Professor Paychecks"¶

Amelia Hsu, Jason Liu, and Brian Xiang¶

profasidowijh

Table of Contents¶

  1. Motivation
  2. Hypothesis
  3. Data Collection/Curation
    1. Diamondback Salary Guide
    2. Rate My Professors
    3. PlanetTerp
  4. Data Munging
    1. Tidying the Data
      1. Diamondback Salary Guide
      2. Rate My Professors
      3. PlanetTerp
    2. Name Matching
    3. Creating Reviews Dataset
  5. Data Representation
  6. Exploratory Data Analyis
    1. Inital Graphing
    2. Graphing by Department
    3. Changes Over Time
  7. Review Content Analysis
    1. Some Fun with Word Clouds
    2. Sentiment Analysis
  8. Hypothesis Testing
  9. Insights and Conclusions

Motivation¶

Let's be honest — professors all vary in teaching style (and quality). Students tend to be extremely vocal about their opinions of professors. Everyone is looking out for one other. Friends want each other to get the best professors that they can and avoid those that they may not learn the best from. As a result, online platforms have been created to house student reviews of professors, the most commonly known website being Rate My Professors, which has data on over 1.3 million professors, 7,000 schools and 15 million ratings. Three student at the University of Maryland, College Park even took the initiative to create their own platform to gather specifically UMD professor ratings, PlanetTerp, which includes over 11,000 professors and 16,000 reviews. PlanetTerp has the additional feature of including course grades for each UMD course; as of right now there are nearly 300,000 course grades stored on the site.

Starting in 2013, The Diamondback, UMD's premier independent student newspaper, began publishing an annual salary guide: The Diamondback Salary Guide. The Diamondback Salary Guide displays every university employee's yearly pay in an easily digestible format for all to view. This information is public data provided to The Diamondback by the university itself; The Diamondback simply made this data more accessible to all by posting it on a single website.

The Diamondback Salary Guide states, "[w]e won't tell you what conclusions to draw from these numbers, but it's our hope that they'll give you the information you need to reflect on your own." In this final tutorial, we plan to do just that: compare both the salaries and ratings of UMD professors and reflect on our findings. From our own past experiences, we have observed that our favorite professors are not always the ones being paid the highest salaries. We are interested in the possiblity of a potential correlation between these two attributes. If there is a correlation between professor salary and rating, what is it? If a correlation exists, can we use this information to predict professor salary based on student reviews and vice versa?

Hypothesis¶

We hypothesize that there will be a negative correlation between a professor's rating and their yearly salary. In other words, as a professor's rating increases, their yearly salary should decrease. We predict this might be the case because tenured professors often retain higher annual salaries, even if their teaching quality dips or decreases over time.

Data Collection/Curation¶

In order to observe the relationship between professor salary and rating, we collected data from three sources: Diamondback Salary Guide (DBK), Rate My Professors (RMP), and PlanetTerp (PT). DBK was our source of professor salary data, and a combination of RMP and PT was used as our source of professor rating data.

Diamondback Salary Guide¶

Diamondback Salary Guide has an undocumented API. However, we were able to learn about the API by looking at the network requests as we modified parameters on the site, which meant we could programmatially go through all of the pages and pull full data for all of the years that the Salary Guide tracks (2013-2022).

Our scraping code for DBK data can be found in src/scrape_salaries.py

Rate My Professors¶

Rate My Professors also has an undocumented API. This is discovered by inspecting the network requests as we loaded a page of professor reviews, noting that there was a GraphQL query, then inspecting and then copying over the GraphQL query, authentication, and the necessary variables we needed to emulate the request locally.

Interestingly, although their API technically requires authentication, the request from the website leaks the necessary Basic authentication header, which is test:test encoded in base64.

In addition, RMP performs fuzzy searching on their search endpoint, which lets us implement fuzzy name matching while we collect the RMP data. As we were gathering the data from RMP, we fed the RMP API each name in our DBK dataset. If RMP found a match for the name, it returned the appropriate professor rating information. This meant that each professor we had in our RMP dataset was automatically matched with a professor in our DBK dataset as we gathered the RMP professor data.

Our scraping code for RMP data can be found in src/scrape_rmp.py

PlanetTerp¶

PlanetTerp was created by UMD students and thus the creators were generous enough to document an API to help fellow students use the data available on their website.

Using their /api/v1/professors/ endpoint, we were able to collect a list of all of the professors that PlanetTerp has data on that have taught at UMD (over 11,000!), and get a list of all of the courses they've taught, their average rating over all of their courses, and all of their reviews, each of which included the text content, rating, expected grade, and creation time.

Our scraping code for PlanetTerp data can be found in src/scrape_pt.py

Data Munging¶

First, let's import our scraped data. We saved data collected from each of our three sources in their own separate CSV files.

In [118]:
import pandas as pd
import numpy as np

salaries_df = pd.read_csv("./src/data/salaries.csv")
pt_df = pd.read_csv("./src/data/pt_ratings.csv")
rmp_df = pd.read_csv("./src/data/rmp_ratings.csv")

Tidying the Data¶

Before we began doing anything with our data, we first needed to clean it up.

Diamondback Salary Guide¶

The DBK salary guide formats names differently from PlanetTerp and RMP, and contains escaped newlines returned from the API requests. As such, we decided to rearrange the first and last names in order to help our fuzzy search algorithm, and replaced newlines with spaces throughout the dataset. We also converted the salary strings to floats and extracted the school from the department strings.

In [119]:
# Rearrange first and last names to standardize name search in salaries
salaries_df["name"] = salaries_df["employee"].apply(
    lambda x: " ".join(x.split(", ")[::-1])
)

# Replace newlines with spaces
salaries_df["department"] = salaries_df["department"].str.replace("\n", " ").str.strip()

# Converse salaries to floats
salaries_df["salary"] = (
    salaries_df["salary"].replace("[\$,]", "", regex=True).astype(float)
)

# Extract school from department data
salaries_df["school"] = salaries_df["department"].str.split("-").str[0]
salaries_df.loc[~salaries_df["department"].str.contains("-"), "school"] = np.nan

Rate My Professors¶

While collecting data from RMP, we noticed something odd about each professor’s Overall Quality score. The results we calculated when averaging a professor’s individual quality ratings were not equal to the professor's Overall Quality score. We are not sure what factors are taken into account by RMP when calculating overall quality. When students create new reviews on RMP, they are asked to score the professor’s helpfulness and clarity. We can see each review’s helpfulRating and clarityRating in the API data which we collected. However, the RMP website only displays a “Quality” score. In the vast majority of cases, we have found that the Quality score is calculated by averaging the Helpful and Clarity scores ((helpfulRating + clarityRating) / 2). However, after performing a few calculations by hand, we found that a professor’s Overall Quality is not a result of averaging each review’s Quality score.

Let’s take Clyde Kruskal as an example: at the time of our calculations, RMP gave Kruskal an Overall Quality score of 2.30. However, the average of each review’s Quality was 2.14, the average of each review’s helpfulRating was 2.11, and the average of each review’s clarityRating was 2.16, none of which are equal to 2.30. It is unclear what is causing this discrepancy. Is RMP factoring in the difficulty ratings of the professors? How recent each review is? Overall Quality score is a mystery black box number to us.

Since we do not know how RMP is calculating this score, we chose to average each review’s quality rating and use this value for the average rating, since we know exactly how this is calculated.

In [120]:
rmp_df = rmp_df[rmp_df["reviews"] != "[]"]

# Snippet from src/scrape_rmp.py
def calculate_ratings(names, rmp_get_ratings):
    df = pd.DataFrame(columns=["name", "rating", "courses", "reviews"])

    for i, name in enumerate(names):
        # handle cases where someone has middle name(s)
        splitted = name.split(" ")
        # if multiple middle names, only use first and last 
        if len(splitted) >= 3:
            name = splitted[0] + " " + splitted[-1]
        print(f"getting reviews for {name} {i}/{len(names)}")
        # unique set of courses taught by a professor
        courses = set()
        # call function to make request to api
        ratings = rmp_get_ratings(name)
        reviews = []
        score = 0
        # iterate over each review and get clarity/helpful ratings
        for rating in ratings:
            data = rating["node"]
            course = data["class"]
            courses.add(course)
            score += data["clarityRating"]
            score += data["helpfulRating"]

            # create our review object
            reviews.append(
                {
                    "professor": name,
                    "course": course,
                    "review": data["comment"],
                    "rating": data["clarityRating"],
                    "expected_grade": data["grade"],
                    "created": data["date"],
                }
            )

        if len(ratings) != 0:
            # since we add both ratings, divide by 2
            score /= len(ratings) * 2
        else:
            score = 0

        # append to dataframe
        df.loc[len(df)] = [name, score, list(courses), reviews]

PlanetTerp¶

PlanetTerp has many listings for professors that have zero ratings, which is not helpful in our data exploration. For this reason, we removed all professors from our PT dataset who had no reviews. We also noticed that it was possible for PT to have multiple listings for the same professor (see Madeleine Goh and Madeleine Goh). These duplicate entries are eventually broken out into individual review rows, so we don't need to make any special exceptions for these professors.

In [121]:
# Drop professors without any reviews
pt_df = pt_df[pt_df["reviews"] != "[]"]

Name Matching¶

To connect a professor’s salary to their ratings, we needed to find a way to match the names from each dataset to each other. This proved to be a bit more difficult than we expected, because professor names were not standardized between the three platforms. Sometimes professor names included middle names, sometimes they included a middle initial, and sometimes no middle name was provided at all. Occasionally, professor nicknames were listed instead of their full names. With over three thousand different professors, we could not possibly match professor names by hand. Thus we needed a method to find the best professor matches between the three datasets. We used fuzzy name matching to accomplish this task. Fuzzy matching (also known as approximate string matching) is an artificial intelligence and machine learning technology that helps identify two elements that are approximately similar but not exactly the same.

We explored two different options for matching professor names from PlanetTerp to Diamondback Salary Guide. One option that we considered was using Hello My Name Is (HMNI), a form of fuzzy name matching using machine learning. However, we decided against using HMNI because it was two years outdated and had trouble running on our versions of python. The next method that we tried was using fuzzywuzzy or fuzzyset, which also performs fuzzy name matching, but uses the Levenshtein distance to calculate similarities between names. The Levenshtein distance is the number of deletions, insertions, or substitutions required to transform one string to another. We ultimately decided to use fuzzyset to match professor names from PT to DBK because fuzzyset had faster performance than fuzzywuzzy, and we were receiving more successful, correct matches than with HMNI.

However, name matching did not end there. As previously mentioned, some professors had middle names or middle initials while others did not. This heavily impacted the Levenshtein distance calculations being performed by fuzzyset. After running a preliminary round of name matching, we noticed that many professors in PT were not being matched to the correct listing in DBK because of the presence/absence of middle names; this was especially the case for those with longer middle names. In order to resolve this issue, we ran two rounds of fuzzyset name matching. In the first round, we attempted to match the entire name, and in the second round we matched with only the frist and last name. We only ran a second round of name matching with professors that did not get matched very accurately in the first round.

With this addition, we were able to match over 500 previously unmatched professors whose data would have been thrown away without this change (the number of matched professors increased from 2,036 to 2,561). Both rounds considered a name match to be a string match with a confidence value of 0.75 or higher. While this does not 100% guarantee that every pair of names that we match are the same person, this method of name matching was the best that we could do considering that we did not have a validation set to check the accuracy of our fuzzy matching results and there were simply too many professor names for our group to check over every pairing by hand.

In [122]:
try:
    from cfuzzyset import cFuzzySet as FuzzySet
except ImportError:
    from fuzzyset import FuzzySet

# Return first and last name
def get_fl(s: str):
    sl = s.split()
    return f"{sl[0]} {sl[-1]}"

# Merge datasets fuzzily
def fuzzy_merge(
    d1: pd.DataFrame, d2: pd.DataFrame, fuzz_on="", alpha=0.75, beta=0.75, how="inner"
):
    d1_keys = d1[fuzz_on]
    d2_keys = d2[fuzz_on]

    # Create the corresponding fuzzy set for our keys
    # We pick the larger keyset to "fuzz" off of for performance and accuracy reasons
    fuzz_left = len(d2_keys.unique()) > len(d1_keys.unique())
    if fuzz_left:
        fuzz = FuzzySet(d2_keys.unique())
        fuzz_fl = FuzzySet(d2_keys.apply(get_fl).unique())
    else:
        fuzz = FuzzySet(d1_keys.unique())
        fuzz_fl = FuzzySet(d1_keys.apply(get_fl).unique())

    # Row helper that grabs matching name from fuzzy set
    def fuzzy_match(row):
        key = row[fuzz_on]
        matches = fuzz.get(key)
        match_conf, match_name = matches[0]

        # Beta is our cutoff confidence for doing 2nd round matching w/o middle names
        if match_conf <= beta:
            matches = fuzz_fl.get(key)
            match_conf, match_name = matches[0]

        # Return match if confidence is >= alpha
        return match_name if match_conf >= alpha else None

    # Apply fuzzy match and merge datasets
    if fuzz_left:
        d1["_fuzz"] = d1.apply(fuzzy_match, axis=1)
        return pd.merge(d1, d2, left_on="_fuzz", right_on=fuzz_on, how=how).rename(
            columns={"_fuzz": fuzz_on}
        )
    else:
        d2["_fuzz"] = d2.apply(fuzzy_match, axis=1)
        return pd.merge(d1, d2, left_on=fuzz_on, right_on="_fuzz", how=how).rename(
            columns={"_fuzz": fuzz_on}
        )
In [123]:
# Merge PlanetTerp <-> DBK salaries
merge_pt = fuzzy_merge(pt_df, salaries_df, fuzz_on="name", how="inner")
merge_pt.head()
/tmp/ipykernel_30824/2604881466.py:31: DeprecationWarning:

This function will be remove in v3.0.0. Use rapidfuzz.distance.Levenshtein.normalized_similarity instead.

/tmp/ipykernel_30824/2604881466.py:36: DeprecationWarning:

This function will be remove in v3.0.0. Use rapidfuzz.distance.Levenshtein.normalized_similarity instead.

Out[123]:
courses average_rating type reviews name_x slug name year employee department division title salary name_y school
0 ['ENME674', 'ENMA300', 'ENME684', 'ENME489Z', ... 5.0 professor [{'professor': 'Abhijit Dasgupta', 'course': '... Abhijit Dasgupta dasgupta_abhijit Abhijit Dasgupta 2013 Dasgupta, Abhijit ENGR-Mechanical Engineering A. James Clark School of Engineering Prof 167138.22 Abhijit Dasgupta ENGR
1 ['ENME674', 'ENMA300', 'ENME684', 'ENME489Z', ... 5.0 professor [{'professor': 'Abhijit Dasgupta', 'course': '... Abhijit Dasgupta dasgupta_abhijit Abhijit Dasgupta 2014 Dasgupta, Abhijit ENGR-Mechanical Engineering A. James Clark School of Engineering Prof 183580.92 Abhijit Dasgupta ENGR
2 ['ENME674', 'ENMA300', 'ENME684', 'ENME489Z', ... 5.0 professor [{'professor': 'Abhijit Dasgupta', 'course': '... Abhijit Dasgupta dasgupta_abhijit Abhijit Dasgupta 2015 Dasgupta, Abhijit ENGR-Mechanical Engineering A. James Clark School of Engineering Prof 190895.40 Abhijit Dasgupta ENGR
3 ['ENME674', 'ENMA300', 'ENME684', 'ENME489Z', ... 5.0 professor [{'professor': 'Abhijit Dasgupta', 'course': '... Abhijit Dasgupta dasgupta_abhijit Abhijit Dasgupta 2016 Dasgupta, Abhijit ENGR-Mechanical Engineering A. James Clark School of Engineering Prof 190895.40 Abhijit Dasgupta ENGR
4 ['ENME674', 'ENMA300', 'ENME684', 'ENME489Z', ... 5.0 professor [{'professor': 'Abhijit Dasgupta', 'course': '... Abhijit Dasgupta dasgupta_abhijit Abhijit Dasgupta 2017 Dasgupta, Abhijit ENGR-Mechanical Engineering A. James Clark School of Engineering Prof 198038.26 Abhijit Dasgupta ENGR
In [124]:
# Merge rmp_df <-> DBK salaries
merge_rmp = fuzzy_merge(rmp_df, salaries_df, fuzz_on="name", how="inner")
merge_rmp.head()
/tmp/ipykernel_30824/2604881466.py:31: DeprecationWarning:

This function will be remove in v3.0.0. Use rapidfuzz.distance.Levenshtein.normalized_similarity instead.

/tmp/ipykernel_30824/2604881466.py:36: DeprecationWarning:

This function will be remove in v3.0.0. Use rapidfuzz.distance.Levenshtein.normalized_similarity instead.

Out[124]:
name_x rating courses reviews name year employee department division title salary name_y school
0 Pamela Abshire 3.333333 ['ENEE419A', 'ENEE408D'] [{'professor': 'Pamela Abshire', 'course': 'EN... Pamela A. Abshire 2013 Abshire, Pamela A. ENGR-Electrical & Computer Engineering A. James Clark School of Engineering Assoc Prof 82872.96 Pamela A. Abshire ENGR
1 Pamela Abshire 3.333333 ['ENEE419A', 'ENEE408D'] [{'professor': 'Pamela Abshire', 'course': 'EN... Pamela A. Abshire 2013 Abshire, Pamela A. ENGR-Institute for Systems Research A. James Clark School of Engineering Assoc Prof 55149.36 Pamela A. Abshire ENGR
2 Pamela Abshire 3.333333 ['ENEE419A', 'ENEE408D'] [{'professor': 'Pamela Abshire', 'course': 'EN... Pamela A. Abshire 2013 Abshire, Pamela A. UGST-Honors College Undergraduate Studies Lecturer 5000.00 Pamela A. Abshire UGST
3 Pamela Abshire 3.333333 ['ENEE419A', 'ENEE408D'] [{'professor': 'Pamela Abshire', 'course': 'EN... Pamela A. Abshire 2014 Abshire, Pamela A. ENGR-Electrical & Computer Engineering A. James Clark School of Engineering Assoc Prof 82427.95 Pamela A. Abshire ENGR
4 Pamela Abshire 3.333333 ['ENEE419A', 'ENEE408D'] [{'professor': 'Pamela Abshire', 'course': 'EN... Pamela A. Abshire 2014 Abshire, Pamela A. ENGR-Institute for Systems Research A. James Clark School of Engineering Assoc Prof 66496.05 Pamela A. Abshire ENGR

Data Representation¶

Creating Reviews Dataset¶

After invididually merging the PT and RMP data with the DBK salaries, we extracted all the reviews from each professor across these two platforms and created a dataframe for each review entry:

In [125]:
from ast import literal_eval
import os

# Cache the reviews, since we do some heavy GPU intensive sentiment analysis later
if not os.path.exists("./src/data/reviews.csv"):
    reviews_df = []

    # Combine our individually merged dfs and group by name
    for name, rows in pd.concat([merge_pt, merge_rmp]).groupby("name"):
        for rs in map(literal_eval, rows["reviews"].unique()):
            for r in rs:
                reviews_df.append({**r, "professor": name})

    reviews_df = pd.DataFrame(reviews_df)

    # Drop expected_grade because that doesn't exist for PT
    reviews_df = reviews_df.drop(columns=["expected_grade"])

    # Replace name column w/ professor, which is our DBK name
    reviews_df = reviews_df.rename(columns={"professor": "name"})

    # Fix datetimes
    reviews_df["created"] = pd.to_datetime(reviews_df["created"].str.replace("UTC", ""))

    # Get year of created
    reviews_df["year"] = pd.DatetimeIndex(reviews_df["created"]).year

    # NOTE: This is a placeholder for later num_reviews calculations -- it should be 1 for each row
    reviews_df["num_reviews"] = 1
    
else:
    reviews_df = pd.read_csv("./src/data/reviews.csv", lineterminator="\n", index_col=0)

reviews_df.head()
Out[125]:
name course review rating created year num_reviews
0 A W. Kruglanski PSYC489H DO NOT TAKE PSYC489H "Motivated Social Cogniti... 2 2015-09-07 18:44:00+00:00 2015 1
1 A.U. Shankar CMSC412 Lectures are pretty dry and difficult to follo... 3 2013-01-02 21:32:00+00:00 2013 1
2 A.U. Shankar CMSC412 Professor: He does have a stutter, but if you ... 3 2012-12-23 03:51:00+00:00 2012 1
3 A.U. Shankar CMSC412 This is a horrible class. The projects are imp... 1 2012-10-29 00:54:00+00:00 2012 1
4 A.U. Shankar CMSC412 I have a lot of respect for Dr. Shankar. He is... 5 2012-05-24 13:00:00+00:00 2012 1

Here's the resulting breakdown of our dataset: | Column | Description | |--------|----------------| | name | Fuzzy matched DBK name of the professor | | course | The course that the review was written for | | review | Contents of the review | | rating | Rating for the professor on a scale of 1-5 | | created | Datetime for when the review was written| | year | Year in which the review was written| | num_reviews | Used later to count the number of reviews (1 for the moment) |

Exploratory Data Analysis¶

Initial Graphing¶

After matching DBK salaries to PT ratings, we created a preliminary graph to visualize the data that we had tirelessly toiled to collect, tidy, and match.

For this first graph, we plotted every professor who has at least one review in either Rate My Professors and PlanetTerp and at least one entry in the Diamondback Salary Guide. We plotted each professor using their average rating and their most recently posted salary.

In [126]:
import plotly.io as pio
import matplotlib.pyplot as plt

# Styles & use plotly as our backend
pd.options.plotting.backend = "plotly"
pio.templates.default = "plotly_dark"
plt.style.use("dark_background")
In [127]:
# Group by each professor and year -- then sum their salaries
# This makes it so that we have a single salary record for each name + year
salaries_df = salaries_df.groupby(["name", "year"], as_index=False).agg(
    {
        "school": "first",
        "salary": "sum",
    }
)
In [128]:
# Match each review with the corresponding salary record for that name + year
merged_all_years_all_reviews = reviews_df.merge(salaries_df, on=["name", "year"], how="left")
In [129]:
# Match all reviews with the latest name + year record
merged_last_year_all_reviews = (
    reviews_df.merge(salaries_df, on=["name", "year"], how="outer")
    .sort_values("year", ascending=False)
    .groupby("name", as_index=False)
    .agg(
        {
            "school": "first",
            "salary": "first",
            "name": "first",
            "year": "first",
            "num_reviews": "sum",
            "rating": "mean",
        }
    )
)

# Multi-use labels for all plots
labels = {
    "rating": "Average Rating (1 to 5)",
    "salary": "Salary (US Dollars)",
    "num_reviews": "Number of Reviews",
    "school": "School",
    "sentiment": "Sentiment (-1 to 1)",
}
In [130]:
# Plot all professors w/ their latest salary as the y value, and average rating as their x value
merged_last_year_all_reviews.plot(
    kind="scatter",
    x="rating",
    y="salary",
    hover_data=["name", "year", "num_reviews"],
    trendline="ols",
    trendline_color_override="orange",
    title="Average Rating vs. Most Recent Salary",
    labels=labels,
)

Looking at this preliminary graph, we noticed a large concentration of points on lines x = 1.0, 2.0, 3.0, 4.0, and 5.0. These concentrations are from the large numbers of professors on PlanetTerp whose students generally don’t hold any strong positive/negative opinions, and only have 1 review. After seeing this, we decided to filter out any professors with very few reviews, which does reduce the size of our dataset, but it reduces the number of one-off really high/low reviews that might otherwise skew our data.

In [131]:
# Same as above, but we only take professors w/ >= 10 reviews
merged_last_year_all_reviews[merged_last_year_all_reviews["num_reviews"] >= 10].plot(
    kind="scatter",
    x="rating",
    y="salary",
    hover_data=["name", "year", "num_reviews"],
    trendline="ols",
    trendline_color_override="orange",
    title="Average Rating vs. Most Recent Salary (professors with at least 10 reviews)",
    labels=labels,
)

This looks much better. 👍👍👍

Graphing by Department¶

Using these datapoints, let's label each point by the school the professor is in.

In [132]:
# Same as above, but we color code by school
merged_last_year_all_reviews[merged_last_year_all_reviews["num_reviews"] >= 10].plot(
    kind="scatter",
    x="rating",
    y="salary",
    color="school",
    hover_data=["name", "year", "num_reviews", "school"],
    title="Average Rating vs. Most Recent Salary (colored by School)",
    labels=labels,
)

It seems like there is an overwhelming number of CMNS professors on this scatterplot. Out of curiousity, let's also take a look at how many reviews each professor gets in each department.

In [133]:
# Group by name and sum up the number of reviews
merged_all_years_all_reviews.groupby("name", as_index=False).agg(
    {
        "num_reviews": "sum",
        "school": "first",
        "salary": "first",
        "year": "first",
        "name": "first",
    }
).plot(kind="box", x="school", y="num_reviews", color="school", hover_data=["name"], labels=labels)
In [134]:
merged_all_years_all_reviews.groupby("school")["num_reviews"].sum().sort_values(
    ascending=False
).head(10)
Out[134]:
school
CMNS    5860
ARHU    2709
BSOS    2054
BMGT    1135
ENGR    1112
AGNR     541
SPHL     380
INFO     297
JOUR     214
UGST     160
Name: num_reviews, dtype: int64

The CMNS professors have the most reviews per professor. This makes intuitive sense, since CMNS (more specifically CMSC students) are likely on the internet more often and are more technologically savvy, and thus these students are reviewing more of their professors. There could also be a large number of CMSC reviews because more CMNS professors exist compared to other departments. However, ARHU has the 2nd most professors but not the 2nd most number of reviews.

Next, let's graph the professors for each department on a separate graph to see if the ratings vs. salaries for each department follow a similar trend.

In [135]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import statsmodels.formula.api as smf

temp = merged_last_year_all_reviews[merged_last_year_all_reviews["num_reviews"] >= 10]

# Create subplots
fig = make_subplots(
    rows=(len(temp["school"].unique()) // 2),
    cols=2,
    subplot_titles=sorted(temp["school"].unique()),
)

# Create subplot for each school
for i, (school, school_df) in enumerate(temp.groupby("school")):
    # Drop rows w/o salaries
    school_df = school_df.dropna(subset=["salary"])

    # Skip if school has no datapoints
    if len(school_df) == 0:
        continue

    row = (i // 2) + 1
    col = (i % 2) + 1

    # Plot rating vs. salary
    fig.add_trace(
        go.Scatter(x=school_df["rating"], y=school_df["salary"], mode="markers"),
        row=row,
        col=col,
    )

    # Plot regline
    model = smf.ols(formula="salary ~ rating", data=school_df).fit()
    y_hat = model.predict(school_df["rating"])
    fig.add_trace(
        go.Scatter(
            x=school_df["rating"], y=y_hat, mode="lines", line=dict(color="#ffe476")
        ),
        row=row,
        col=col,
    )

    # Update axes
    fig.update_xaxes(title_text=labels["rating"], row=row, col=col)
    fig.update_yaxes(title_text=labels["salary"], row=row, col=col)

# Update size + title + disable legend
fig.update_layout(
    height=3000,
    width=1200,
    title_text="Rating vs. Salary Per Department",
    showlegend=False,
)

# Normalize our axises
fig.update_xaxes(range=[0.8, 5.2])
fig.update_yaxes(range=[0, temp["salary"].max()])

# Show figure
fig.show()

As we can see, each department is extremely different. There does not appear to be a similar trend whatsoever. However, some departments have significantly less datapoints than others (some do not even have the >1 points necessary to plot a linear regression).

More analysis needs to be done. Currently, we are plotting any professor that has at least one rating and at least one salary. However, this does not account for salaries that have changed over time due to changes in position, inflation, pay raises, etc. Before we made any assumptions about the general trend of the data, let's take time into account.

Changes Over Time¶

In [136]:
# Create subplot
fig = make_subplots(
    rows=5,
    cols=2,
    start_cell="top-left",
    subplot_titles=[f"Rating vs. Salary for {i}" for i in range(2013, 2023)],
)

# Create subplot for each year
i = 0
for year, year_df in merged_all_years_all_reviews.groupby("year"):
    # Calculate average rating and # of reviews for each professor
    year_df = year_df.groupby("name", as_index=False).agg(
        {
            "rating": "mean",
            "school": "first",
            "salary": "first",
            "year": "first",
            "num_reviews": "sum",
        }
    )

    # Filter by at least 3 reviews and drop professors w/o any salaries
    year_df = year_df[year_df["num_reviews"] >= 3]
    year_df = year_df.dropna(subset=["salary"])

    # Skip empty years
    if len(year_df) == 0:
        continue

    row = (i // 2) + 1
    col = (i % 2) + 1

    # Plot rating vs. salary
    fig.add_trace(
        go.Scatter(x=year_df["rating"], y=year_df["salary"], mode="markers"),
        row=row,
        col=col,
    )

    # Plot regline
    model = smf.ols(formula="salary ~ rating", data=year_df).fit()

    # Output coefficients
    print(str(year) + " Slope: " + str(model.params[1]))

    y_hat = model.predict(year_df["rating"])
    fig.add_trace(
        go.Scatter(
            x=year_df["rating"], y=y_hat, mode="lines", line=dict(color="#ffe476")
        ),
        row=row,
        col=col,
    )

    # Update labels
    fig.update_xaxes(title_text=labels["rating"], row=row, col=col)
    fig.update_yaxes(title_text=labels["salary"], row=row, col=col)

    i += 1

# Update size + title + disable legend
fig.update_layout(
    height=1500, width=1200, title_text="Rating vs. Salary Per Year", showlegend=False
)

# Normalize x axis
fig.update_xaxes(range=[0.8, 5.2])
fig.show()
2013 Slope: -13864.290727440639
2014 Slope: -2577.7150769865366
2015 Slope: 1849.6050756652894
2016 Slope: -4653.105928150357
2017 Slope: -8476.70450319327
2018 Slope: -2126.2029707391066
2019 Slope: -1068.2108372486932
2020 Slope: -7154.679293679653
2021 Slope: -7917.6360369754475
2022 Slope: -4878.901763837461

Now that we've separated our data by year, there appears to be a slight negative correlation between average professor rating and salary. The only exception to this trend is 2015, where the slope of the regression line is positive, although also very shallow.

The most steep negative slopes are for the years 2013, 2017, 2020, and 2021. While the circumstances of 2013 and 2017 are not clear, it is likely that students were dissatisfied with the quality of teaching during the height of the COVID-19 pandemic (2020 and 2021) as professors were adjusting to new methods of online teaching, resulting in lots of extremely positive and extremely negative student reviews. These extreme reviews would thus have an impact on the steepness of the linear regression line.

Review Content Analysis¶

Some Fun with Word Clouds¶

The first thing we learned in this class was that creating word clouds impresses anyone. For this reason, we obviously had to make a word cloud of the most common words used in professor reviews. We tried to remove the most common school/college related words: class, course, lecture, lectures, professor, student, students, exam, exams, test, and tests.

In [137]:
from wordcloud import WordCloud, STOPWORDS

# Shared config for each wordcloud
kwargs = {
    "background_color": "black",
    "max_font_size": 40,
    "scale": 3,
    "colormap": "Set2",
    "stopwords": set(
        [
            "class",
            "course",
            "lecture",
            "professor",
            "student",
            "students",
            "exam",
            "exams",
            "test",
            "tests",
            "lectures",
        ]
    )
    | STOPWORDS,
}

# Collect review words
words = merged_all_years_all_reviews["review"].str.cat(sep=" ")

# Generate wordcloud
wc = WordCloud(**kwargs).generate(words)

# Show figure
plt.figure(figsize=(15, 10))
plt.imshow(wc)
plt.axis("off")
plt.show()

Admittedly, this word cloud shows us nothing. HOWEVER, it looks pretty neat. Let's separate the "good" reviews from the "bad" reviews and see if the most common words differ drastically. We define a "good" review as a review with a rating above 3 stars, and a "bad" review as a review with a rating under 3 stars. Let's also make them into turtles to show our school pride.

In [139]:
# GOOD REVIEWS
from PIL import Image

# Turtle :)
mask = np.array(Image.open("src/img/turtle.jpg"))

# Collect words (for reviews w/ rating above 3)
words = merged_all_years_all_reviews[merged_all_years_all_reviews["rating"] > 3][
    "review"
].str.cat(sep=" ")

# Generate wordcloud
wc = WordCloud(mask=mask, **kwargs).generate(words)

# Show figure
plt.figure(figsize=(15, 10))
plt.imshow(wc)
plt.axis("off")
plt.show()
In [140]:
# BAD REVIEWS

# Turtle :(
mask = np.array(Image.open("src/img/upsidedown_turtle.jpg"))

# Collect words (for reviews w/ rating below 3)
words = merged_all_years_all_reviews[merged_all_years_all_reviews["rating"] < 3][
    "review"
].str.cat(sep=" ")

# Generate wordcloud
wc = WordCloud(mask=mask, **kwargs).generate(words)

# Show figure
plt.figure(figsize=(15, 10))
plt.imshow(wc)
plt.axis("off")
plt.show()

Of the "good" and "bad" word turtles, the words that stood out the most to us were:

GOOD: easy, great, interesting, good, helpful, extra credit

BAD: difficult, worst, hard, boring, never, avoid, nothing

Sentiment Analysis¶

We wanted to come up with a numeric metric that could help us better guage the exact positivity / negativity of the review comments. It turns out that this problem can be solved with Sentiment Analysis, which uses NLP and machine learning to tokenize and quantify a series of words into a sentiment label and confidence score, which can be used to summarize the mood and tone of our reviews. As such, we used transformers, an ML library provided by Hugging Face, to create a sentiment pipeline that generates a sentiment label + score for each of our reviews. The sentiment pipeline uses the DistilBERT model, which is a pre-trained NLP model that can be used to perform sentiment analysis.

In [142]:
# Cache our reviews
if not os.path.exists("./src/data/reviews.csv"):
    from transformers import pipeline
    from transformers.pipelines.pt_utils import KeyDataset

    # Create sentiment pipeline
    sentiment_pipeline = pipeline("sentiment-analysis", device=0)

    def get_sentiment(review):
        # Ignore empty reviews
        if not review:
            return None

        # Feed review into pipeline and extract sentiment
        # NOTE: DistilBERT truncates input to the first 512 tokens
        sentiment = sentiment_pipeline(review, truncation=True)[0]
        return (-1 if sentiment["label"] == "NEGATIVE" else 1) * sentiment["score"]

    # Apply & save to file
    reviews_df["sentiment"] = reviews_df["review"].apply(get_sentiment)
    reviews_df.to_csv("./src/data/reviews.csv")

reviews_df.head(10)
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
/home/cf12/miniconda3/envs/directml/lib/python3.10/site-packages/transformers/pipelines/base.py:1043: UserWarning:

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset

Out[142]:
name course review rating created year num_reviews sentiment
0 A W. Kruglanski PSYC489H DO NOT TAKE PSYC489H "Motivated Social Cogniti... 2 2015-09-07 18:44:00+00:00 2015 1 -0.998593
1 A.U. Shankar CMSC412 Lectures are pretty dry and difficult to follo... 3 2013-01-02 21:32:00+00:00 2013 1 -0.999379
2 A.U. Shankar CMSC412 Professor: He does have a stutter, but if you ... 3 2012-12-23 03:51:00+00:00 2012 1 -0.976684
3 A.U. Shankar CMSC412 This is a horrible class. The projects are imp... 1 2012-10-29 00:54:00+00:00 2012 1 -0.999185
4 A.U. Shankar CMSC412 I have a lot of respect for Dr. Shankar. He is... 5 2012-05-24 13:00:00+00:00 2012 1 0.996378
5 A.U. Shankar CMSC216 Stutters. Slow lectures. Exams are exactly the... 1 2017-11-19 22:26:47+00:00 2017 1 -0.999431
6 A.U. Shankar CMSC216 One of the worst lecturers I have had so far. ... 1 2018-01-23 22:50:53+00:00 2018 1 -0.999699
7 A.U. Shankar CMSC216 Shankar is a nice guy if you were to ever spea... 2 2018-04-11 00:07:08+00:00 2018 1 -0.954501
8 A.U. Shankar CMSC216 He is very nice if you talk to him, but if you... 1 2019-10-20 23:58:06+00:00 2019 1 -0.992549
9 A.U. Shankar CMSC216 Stutters, which makes his lectures impossible ... 1 2019-12-17 21:24:31+00:00 2019 1 -0.999610
In [143]:
# Merge salaries with reviews (that have sentiment now)
reviews_salaries_df = reviews_df.merge(
    salaries_df, on=["name", "year"], how="left"
).dropna(subset=["salary"])
In [144]:
import plotly.express as px

fig = px.scatter_3d(
    reviews_df.merge(salaries_df, on=["name", "year"], how="left"),
    x="rating",
    y="sentiment",
    z="salary",
    color="school",
    hover_data=["name"],
    labels=labels,
)
fig.show()

Hypothesis Testing¶

To put some statistical numbers behind our graphs and attempt to prove our hypothesis, we created linear regression models aiming to correlate rating and salary.

We first tested to see if there was a correlation between just professor rating and salary.

In [145]:
# Create linreg model
reg = smf.ols(formula="salary ~ rating", data=reviews_salaries_df).fit()
print(reg.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     39.79
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           2.91e-10
Time:                        03:48:41   Log-Likelihood:            -1.8619e+05
No. Observations:               15174   AIC:                         3.724e+05
Df Residuals:                   15172   BIC:                         3.724e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   1.029e+05   1020.900    100.841      0.000    1.01e+05    1.05e+05
rating     -1680.3686    266.388     -6.308      0.000   -2202.522   -1158.215
==============================================================================
Omnibus:                     2990.138   Durbin-Watson:                   0.447
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             6903.113
Skew:                           1.119   Prob(JB):                         0.00
Kurtosis:                       5.431   Cond. No.                         9.87
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Our R-squared value for this linear regression was extremely low (0.003), suggesting that there is no correlation between rating and salary. Our next step was to try creating a linear regression which incorporated sentiment as an additional variable.

In [146]:
# Create linreg model
reg = smf.ols(formula="salary ~ rating * sentiment", data=reviews_salaries_df).fit()
print(reg.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     18.45
Date:                Fri, 16 Dec 2022   Prob (F-statistic):           6.05e-12
Time:                        03:48:41   Log-Likelihood:            -1.8608e+05
No. Observations:               15166   AIC:                         3.722e+05
Df Residuals:                   15162   BIC:                         3.722e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         1.013e+05   1683.988     60.138      0.000     9.8e+04    1.05e+05
rating            -807.0259    428.650     -1.883      0.060   -1647.231      33.179
sentiment          858.1646   1702.689      0.504      0.614   -2479.312    4195.641
rating:sentiment  -876.1318    430.408     -2.036      0.042   -1719.784     -32.480
==============================================================================
Omnibus:                     2986.092   Durbin-Watson:                   0.448
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             6888.173
Skew:                           1.118   Prob(JB):                         0.00
Kurtosis:                       5.428   Cond. No.                         27.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The R-square value was still extremely low (0.004). From our exploratory graphs, we noticed that different schools followed different trends; this was also the case for differing years. For this reason, we decided to also add in variables for year and school in this last linear regression model.

In [147]:
# Create BETTER linreg model
reg = smf.ols(
    formula="salary ~ rating * sentiment * year * school", data=reviews_salaries_df
).fit()
print(reg.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.195
Model:                            OLS   Adj. R-squared:                  0.184
Method:                 Least Squares   F-statistic:                     19.14
Date:                Fri, 16 Dec 2022   Prob (F-statistic):               0.00
Time:                        03:48:44   Log-Likelihood:            -1.8447e+05
No. Observations:               15166   AIC:                         3.693e+05
Df Residuals:                   14976   BIC:                         3.708e+05
Df Model:                         189                                         
Covariance Type:            nonrobust                                         
==========================================================================================================
                                             coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------------
Intercept                              -1.585e+06   6.06e+06     -0.262      0.793   -1.35e+07    1.03e+07
school[T.ARCH]                          4.285e+06   2.38e+07      0.180      0.857   -4.24e+07     5.1e+07
school[T.ARHU]                           -5.1e+06   6.62e+06     -0.771      0.441   -1.81e+07    7.87e+06
school[T.BMGT]                         -1.387e+07   7.29e+06     -1.903      0.057   -2.82e+07    4.14e+05
school[T.BSOS]                          1.591e+05   6.81e+06      0.023      0.981   -1.32e+07    1.35e+07
school[T.CMNS]                          -1.39e+06   6.33e+06     -0.219      0.826   -1.38e+07     1.1e+07
school[T.DIT]                          -7.524e+07   1.11e+09     -0.068      0.946   -2.26e+09    2.11e+09
school[T.EDUC]                          1.453e+07   1.55e+07      0.935      0.350   -1.59e+07     4.5e+07
school[T.ENGR]                         -1.389e+07   7.23e+06     -1.921      0.055   -2.81e+07    2.83e+05
school[T.EXST]                         -1.788e+10   1.56e+10     -1.144      0.253   -4.85e+10    1.28e+10
school[T.GRAD]                           262.3035    872.643      0.301      0.764   -1448.183    1972.790
school[T.INFO]                         -1.025e+07   1.84e+07     -0.558      0.577   -4.63e+07    2.58e+07
school[T.IT]                           -2.492e+09   3.73e+10     -0.067      0.947   -7.55e+10    7.06e+10
school[T.JOUR]                         -1.375e+07   1.22e+07     -1.129      0.259   -3.76e+07    1.01e+07
school[T.LIBR]                         -3.139e+10   1.02e+11     -0.306      0.759   -2.32e+11    1.69e+11
school[T.PLCY]                         -7.291e+05   3.86e+07     -0.019      0.985   -7.64e+07    7.49e+07
school[T.PRES]                         -2.828e+08   1.67e+08     -1.697      0.090    -6.1e+08    4.39e+07
school[T.PUAF]                           480.6775   1561.926      0.308      0.758   -2580.888    3542.243
school[T.SPHL]                         -1.763e+06   9.31e+06     -0.189      0.850      -2e+07    1.65e+07
school[T.SVPAAP]                         2.46e+07   2.53e+07      0.972      0.331    -2.5e+07    7.42e+07
school[T.UGST]                         -2.556e+07   1.82e+07     -1.406      0.160   -6.12e+07    1.01e+07
school[T.USG]                          -2.264e+09   2.95e+09     -0.768      0.442   -8.04e+09    3.51e+09
school[T.VPA]                           -760.1330   2497.981     -0.304      0.761   -5656.482    4136.216
school[T.VPAA]                          -111.2615    354.374     -0.314      0.754    -805.877     583.354
school[T.VPAF]                         -3.257e+09   1.39e+10     -0.234      0.815   -3.06e+10    2.41e+10
school[T.VPR]                           1.619e+09   4.32e+09      0.375      0.708   -6.85e+09    1.01e+10
school[T.VPSA]                         -1.979e+07    3.5e+07     -0.566      0.571   -8.83e+07    4.87e+07
school[T.VPUR]                          -887.9146   2886.380     -0.308      0.758   -6545.573    4769.744
rating                                  3.742e+05   1.54e+06      0.243      0.808   -2.64e+06    3.39e+06
rating:school[T.ARCH]                  -4.064e+06   5.99e+06     -0.679      0.497   -1.58e+07    7.67e+06
rating:school[T.ARHU]                   7.987e+05   1.69e+06      0.473      0.636   -2.51e+06    4.11e+06
rating:school[T.BMGT]                   3.881e+05   1.87e+06      0.208      0.835   -3.27e+06    4.05e+06
rating:school[T.BSOS]                  -8.178e+05   1.73e+06     -0.473      0.637   -4.21e+06    2.57e+06
rating:school[T.CMNS]                  -2.115e+05   1.61e+06     -0.131      0.895   -3.37e+06    2.94e+06
rating:school[T.DIT]                    1.077e+07   2.23e+08      0.048      0.961   -4.26e+08    4.47e+08
rating:school[T.EDUC]                   -2.59e+06   4.21e+06     -0.615      0.538   -1.08e+07    5.66e+06
rating:school[T.ENGR]                   3.808e+06   1.84e+06      2.072      0.038    2.05e+05    7.41e+06
rating:school[T.EXST]                   3.549e+09   3.11e+09      1.140      0.254   -2.56e+09    9.65e+09
rating:school[T.GRAD]                    234.5135    766.731      0.306      0.760   -1268.373    1737.400
rating:school[T.INFO]                   6.638e+05   4.41e+06      0.150      0.880   -7.99e+06    9.31e+06
rating:school[T.IT]                     6.851e+08   1.13e+10      0.061      0.951   -2.14e+10    2.27e+10
rating:school[T.JOUR]                   4.001e+06   3.14e+06      1.275      0.202   -2.15e+06    1.02e+07
rating:school[T.LIBR]                   9.165e+09   2.99e+10      0.307      0.759   -4.93e+10    6.77e+10
rating:school[T.PLCY]                   9.269e+06   9.84e+06      0.942      0.346      -1e+07    2.86e+07
rating:school[T.PRES]                   7.003e+07   3.43e+07      2.042      0.041     2.8e+06    1.37e+08
rating:school[T.PUAF]                     56.6003    176.195      0.321      0.748    -288.763     401.963
rating:school[T.SPHL]                  -1.721e+06   2.42e+06     -0.711      0.477   -6.47e+06    3.02e+06
rating:school[T.SVPAAP]                  -5.7e+06   6.06e+06     -0.940      0.347   -1.76e+07    6.19e+06
rating:school[T.UGST]                   4.154e+06   4.47e+06      0.929      0.353   -4.61e+06    1.29e+07
rating:school[T.USG]                   -1.132e+10   1.47e+10     -0.768      0.442   -4.02e+10    1.76e+10
rating:school[T.VPA]                     -92.7760    314.108     -0.295      0.768    -708.466     522.914
rating:school[T.VPAA]                    -78.4276    258.447     -0.303      0.762    -585.016     428.161
rating:school[T.VPAF]                   6.486e+08   2.79e+09      0.232      0.816   -4.82e+09    6.12e+09
rating:school[T.VPR]                   -5.665e+08   1.44e+09     -0.393      0.694   -3.39e+09    2.26e+09
rating:school[T.VPSA]                   2.145e+06   7.73e+06      0.277      0.781    -1.3e+07    1.73e+07
rating:school[T.VPUR]                     43.7590    136.082      0.322      0.748    -222.978     310.496
sentiment                              -7.718e+06   6.16e+06     -1.252      0.210   -1.98e+07    4.36e+06
sentiment:school[T.ARCH]               -1.295e+07   2.39e+07     -0.541      0.588   -5.99e+07     3.4e+07
sentiment:school[T.ARHU]                6.359e+06   6.72e+06      0.946      0.344   -6.82e+06    1.95e+07
sentiment:school[T.BMGT]                -4.29e+06    7.4e+06     -0.579      0.562   -1.88e+07    1.02e+07
sentiment:school[T.BSOS]                6.872e+06   6.91e+06      0.994      0.320   -6.68e+06    2.04e+07
sentiment:school[T.CMNS]                7.835e+06   6.44e+06      1.217      0.224   -4.79e+06    2.05e+07
sentiment:school[T.DIT]                -8.792e+07   1.12e+09     -0.079      0.937   -2.28e+09     2.1e+09
sentiment:school[T.EDUC]                1.808e+07   1.59e+07      1.138      0.255   -1.31e+07    4.92e+07
sentiment:school[T.ENGR]                4.598e+06   7.36e+06      0.625      0.532   -9.82e+06     1.9e+07
sentiment:school[T.EXST]               -1.794e+10   1.56e+10     -1.149      0.250   -4.85e+10    1.27e+10
sentiment:school[T.GRAD]                  -5.8503     57.069     -0.103      0.918    -117.714     106.013
sentiment:school[T.INFO]                1.923e+07   1.86e+07      1.035      0.301   -1.72e+07    5.57e+07
sentiment:school[T.IT]                  -2.47e+09   3.73e+10     -0.066      0.947   -7.57e+10    7.07e+10
sentiment:school[T.JOUR]                2.431e+07   1.24e+07      1.953      0.051   -9.08e+04    4.87e+07
sentiment:school[T.LIBR]                2.937e+10   9.58e+10      0.306      0.759   -1.58e+11    2.17e+11
sentiment:school[T.PLCY]                8.612e+05   3.89e+07      0.022      0.982   -7.54e+07    7.71e+07
sentiment:school[T.PRES]               -1.026e+08   1.67e+08     -0.615      0.539   -4.29e+08    2.24e+08
sentiment:school[T.PUAF]                  21.2635     31.234      0.681      0.496     -39.960      82.487
sentiment:school[T.SPHL]                3.774e+06   9.39e+06      0.402      0.688   -1.46e+07    2.22e+07
sentiment:school[T.SVPAAP]              3.122e+07   2.64e+07      1.184      0.236   -2.05e+07    8.29e+07
sentiment:school[T.UGST]               -5.103e+06   1.83e+07     -0.278      0.781    -4.1e+07    3.08e+07
sentiment:school[T.USG]                 2.264e+09   2.95e+09      0.768      0.442   -3.51e+09    8.04e+09
sentiment:school[T.VPA]                    5.4589     21.019      0.260      0.795     -35.741      46.659
sentiment:school[T.VPAA]                   2.8087      7.391      0.380      0.704     -11.679      17.296
sentiment:school[T.VPAF]               -3.262e+09   1.39e+10     -0.234      0.815   -3.06e+10    2.41e+10
sentiment:school[T.VPR]                 -1.88e+09   4.38e+09     -0.429      0.668   -1.05e+10     6.7e+09
sentiment:school[T.VPSA]               -3.063e+07   3.53e+07     -0.869      0.385   -9.97e+07    3.85e+07
sentiment:school[T.VPUR]                  -5.6744      8.584     -0.661      0.509     -22.500      11.151
rating:sentiment                        1.386e+06   1.56e+06      0.887      0.375   -1.68e+06    4.45e+06
rating:sentiment:school[T.ARCH]         4.635e+06   6.01e+06      0.771      0.441   -7.15e+06    1.64e+07
rating:sentiment:school[T.ARHU]        -1.413e+06   1.71e+06     -0.826      0.409   -4.77e+06    1.94e+06
rating:sentiment:school[T.BMGT]         2.073e+06   1.89e+06      1.098      0.272   -1.63e+06    5.77e+06
rating:sentiment:school[T.BSOS]        -7.294e+05   1.75e+06     -0.416      0.677   -4.16e+06    2.71e+06
rating:sentiment:school[T.CMNS]        -1.616e+06   1.63e+06     -0.990      0.322   -4.82e+06    1.58e+06
rating:sentiment:school[T.DIT]          1.902e+07   2.24e+08      0.085      0.932    -4.2e+08    4.58e+08
rating:sentiment:school[T.EDUC]        -3.796e+06   4.26e+06     -0.891      0.373   -1.21e+07    4.56e+06
rating:sentiment:school[T.ENGR]        -1.026e+06   1.86e+06     -0.550      0.582   -4.68e+06    2.63e+06
rating:sentiment:school[T.EXST]         3.614e+09   3.13e+09      1.153      0.249   -2.53e+09    9.76e+09
rating:sentiment:school[T.GRAD]           -5.8632     21.275     -0.276      0.783     -47.565      35.839
rating:sentiment:school[T.INFO]        -4.461e+06   4.42e+06     -1.010      0.312   -1.31e+07     4.2e+06
rating:sentiment:school[T.IT]           6.851e+08   1.13e+10      0.061      0.952   -2.14e+10    2.28e+10
rating:sentiment:school[T.JOUR]        -6.554e+06   3.18e+06     -2.058      0.040   -1.28e+07   -3.12e+05
rating:sentiment:school[T.LIBR]        -8.765e+09   2.85e+10     -0.307      0.759   -6.47e+10    4.72e+10
rating:sentiment:school[T.PLCY]          -7.2e+06   9.74e+06     -0.740      0.460   -2.63e+07    1.19e+07
rating:sentiment:school[T.PRES]        -7.202e+06   3.43e+07     -0.210      0.834   -7.44e+07       6e+07
rating:sentiment:school[T.PUAF]           -0.0674      0.172     -0.392      0.695      -0.405       0.270
rating:sentiment:school[T.SPHL]         9153.5626   2.42e+06      0.004      0.997   -4.74e+06    4.76e+06
rating:sentiment:school[T.SVPAAP]      -4.292e+06   6.27e+06     -0.684      0.494   -1.66e+07       8e+06
rating:sentiment:school[T.UGST]         6.711e+05   4.49e+06      0.149      0.881   -8.14e+06    9.48e+06
rating:sentiment:school[T.USG]          1.132e+10   1.47e+10      0.768      0.442   -1.76e+10    4.02e+10
rating:sentiment:school[T.VPA]            -0.1162      0.384     -0.303      0.762      -0.868       0.636
rating:sentiment:school[T.VPAA]           -0.0016      0.011     -0.140      0.888      -0.024       0.021
rating:sentiment:school[T.VPAF]         6.544e+08   2.79e+09      0.235      0.814   -4.81e+09    6.11e+09
rating:sentiment:school[T.VPR]          6.155e+08   1.45e+09      0.424      0.672   -2.23e+09    3.46e+09
rating:sentiment:school[T.VPSA]          5.03e+06    7.8e+06      0.645      0.519   -1.03e+07    2.03e+07
rating:sentiment:school[T.VPUR]           -0.0024      0.177     -0.013      0.989      -0.348       0.344
year                                     844.2779   3000.697      0.281      0.778   -5037.456    6726.011
year:school[T.ARCH]                    -2151.4849   1.18e+04     -0.182      0.855   -2.53e+04     2.1e+04
year:school[T.ARHU]                     2504.3084   3278.295      0.764      0.445   -3921.551    8930.168
year:school[T.BMGT]                     6880.3313   3611.573      1.905      0.057    -198.794     1.4e+04
year:school[T.BSOS]                      -87.8066   3372.337     -0.026      0.979   -6698.000    6522.387
year:school[T.CMNS]                      675.7271   3137.588      0.215      0.829   -5474.329    6825.783
year:school[T.DIT]                      3.721e+04   5.51e+05      0.068      0.946   -1.04e+06    1.12e+06
year:school[T.EDUC]                    -7210.6741   7701.452     -0.936      0.349   -2.23e+04    7885.114
year:school[T.ENGR]                     6896.3813   3582.354      1.925      0.054    -125.472    1.39e+04
year:school[T.EXST]                     8.882e+06   7.76e+06      1.144      0.253   -6.34e+06    2.41e+07
year:school[T.GRAD]                       -1.8856      4.178     -0.451      0.652     -10.075       6.304
year:school[T.INFO]                     5056.2179   9098.445      0.556      0.578   -1.28e+04    2.29e+04
year:school[T.IT]                       1.237e+06   1.85e+07      0.067      0.947    -3.5e+07    3.75e+07
year:school[T.JOUR]                     6808.7787   6033.950      1.128      0.259   -5018.501    1.86e+04
year:school[T.LIBR]                     1.527e+07   4.98e+07      0.307      0.759   -8.24e+07    1.13e+08
year:school[T.PLCY]                      346.9105   1.91e+04      0.018      0.986   -3.71e+04    3.78e+04
year:school[T.PRES]                     1.399e+05   8.25e+04      1.697      0.090   -2.17e+04    3.02e+05
year:school[T.PUAF]                      -37.5443     38.995     -0.963      0.336    -113.979      38.890
year:school[T.SPHL]                      857.9856   4609.919      0.186      0.852   -8178.020    9893.991
year:school[T.SVPAAP]                  -1.218e+04   1.25e+04     -0.972      0.331   -3.67e+04    1.24e+04
year:school[T.UGST]                     1.262e+04   9001.457      1.402      0.161   -5020.831    3.03e+04
year:school[T.USG]                       1.12e+06   1.46e+06      0.768      0.443   -1.74e+06    3.98e+06
year:school[T.VPA]                      -133.6173    888.135     -0.150      0.880   -1874.471    1607.237
year:school[T.VPAA]                       15.0971     30.792      0.490      0.624     -45.258      75.452
year:school[T.VPAF]                     1.613e+06    6.9e+06      0.234      0.815   -1.19e+07    1.51e+07
year:school[T.VPR]                     -8.007e+05   2.14e+06     -0.375      0.708   -4.99e+06    3.39e+06
year:school[T.VPSA]                     9794.6141   1.73e+04      0.566      0.572   -2.41e+04    4.37e+04
year:school[T.VPUR]                      -45.7452    125.694     -0.364      0.716    -292.120     200.630
rating:year                             -188.4963    762.540     -0.247      0.805   -1683.168    1306.176
rating:year:school[T.ARCH]              2015.2534   2966.200      0.679      0.497   -3798.862    7829.369
rating:year:school[T.ARHU]              -393.0529    836.999     -0.470      0.639   -2033.673    1247.567
rating:year:school[T.BMGT]              -188.7083    924.301     -0.204      0.838   -2000.452    1623.036
rating:year:school[T.BSOS]               406.7899    857.459      0.474      0.635   -1273.935    2087.515
rating:year:school[T.CMNS]               109.2402    797.346      0.137      0.891   -1453.656    1672.137
rating:year:school[T.DIT]              -5322.8543    1.1e+05     -0.048      0.961   -2.21e+05    2.11e+05
rating:year:school[T.EDUC]              1285.3286   2086.327      0.616      0.538   -2804.128    5374.785
rating:year:school[T.ENGR]             -1885.6324    910.582     -2.071      0.038   -3670.485    -100.780
rating:year:school[T.EXST]             -1.763e+06   1.55e+06     -1.140      0.254    -4.8e+06    1.27e+06
rating:year:school[T.GRAD]                -5.2030     13.597     -0.383      0.702     -31.854      21.448
rating:year:school[T.INFO]              -325.4472   2184.455     -0.149      0.882   -4607.247    3956.353
rating:year:school[T.IT]               -3.401e+05   5.59e+06     -0.061      0.951   -1.13e+07    1.06e+07
rating:year:school[T.JOUR]             -1981.5227   1554.505     -1.275      0.202   -5028.543    1065.498
rating:year:school[T.LIBR]             -4.484e+06   1.46e+07     -0.307      0.759   -3.31e+07    2.41e+07
rating:year:school[T.PLCY]             -4580.9211   4873.209     -0.940      0.347   -1.41e+04    4971.166
rating:year:school[T.PRES]             -3.465e+04    1.7e+04     -2.042      0.041   -6.79e+04   -1387.273
rating:year:school[T.PUAF]                11.6218     11.549      1.006      0.314     -11.015      34.258
rating:year:school[T.SPHL]               854.2979   1199.301      0.712      0.476   -1496.478    3205.074
rating:year:school[T.SVPAAP]            2821.7058   3002.397      0.940      0.347   -3063.359    8706.771
rating:year:school[T.UGST]             -2052.1365   2214.397     -0.927      0.354   -6392.625    2288.352
rating:year:school[T.USG]                 5.6e+06    7.3e+06      0.768      0.443    -8.7e+06    1.99e+07
rating:year:school[T.VPA]                 65.9324    417.969      0.158      0.875    -753.338     885.203
rating:year:school[T.VPAA]                 6.4811     33.095      0.196      0.845     -58.389      71.351
rating:year:school[T.VPAF]             -3.213e+05   1.38e+06     -0.232      0.816   -3.03e+06    2.39e+06
rating:year:school[T.VPR]               2.802e+05   7.13e+05      0.393      0.694   -1.12e+06    1.68e+06
rating:year:school[T.VPSA]             -1064.7484   3827.557     -0.278      0.781   -8567.229    6437.732
rating:year:school[T.VPUR]               -57.8329    334.576     -0.173      0.863    -713.643     597.977
sentiment:year                          3832.4920   3053.837      1.255      0.210   -2153.402    9818.386
sentiment:year:school[T.ARCH]           6408.2472   1.19e+04      0.540      0.589   -1.68e+04    2.97e+04
sentiment:year:school[T.ARHU]          -3157.9669   3331.082     -0.948      0.343   -9687.295    3371.361
sentiment:year:school[T.BMGT]           2112.8755   3668.651      0.576      0.565   -5078.130    9303.881
sentiment:year:school[T.BSOS]          -3412.9629   3425.704     -0.996      0.319   -1.01e+04    3301.837
sentiment:year:school[T.CMNS]          -3891.2045   3191.181     -1.219      0.223   -1.01e+04    2363.901
sentiment:year:school[T.DIT]            4.348e+04   5.53e+05      0.079      0.937   -1.04e+06    1.13e+06
sentiment:year:school[T.EDUC]          -8961.8104   7872.301     -1.138      0.255   -2.44e+04    6468.863
sentiment:year:school[T.ENGR]          -2281.2498   3645.376     -0.626      0.531   -9426.633    4864.133
sentiment:year:school[T.EXST]            8.91e+06   7.75e+06      1.149      0.250   -6.29e+06    2.41e+07
sentiment:year:school[T.GRAD]              9.2325     27.594      0.335      0.738     -44.855      63.320
sentiment:year:school[T.INFO]          -9536.4298   9200.686     -1.036      0.300   -2.76e+04    8498.040
sentiment:year:school[T.IT]             1.226e+06   1.85e+07      0.066      0.947   -3.51e+07    3.76e+07
sentiment:year:school[T.JOUR]          -1.205e+04   6165.452     -1.955      0.051   -2.41e+04      32.767
sentiment:year:school[T.LIBR]          -1.487e+07   4.85e+07     -0.306      0.759    -1.1e+08    8.02e+07
sentiment:year:school[T.PLCY]           -449.7502   1.93e+04     -0.023      0.981   -3.82e+04    3.73e+04
sentiment:year:school[T.PRES]           5.075e+04   8.25e+04      0.615      0.538   -1.11e+05    2.12e+05
sentiment:year:school[T.PUAF]             24.7884     37.502      0.661      0.509     -48.719      98.296
sentiment:year:school[T.SPHL]          -1878.0146   4650.369     -0.404      0.686    -1.1e+04    7237.278
sentiment:year:school[T.SVPAAP]        -1.547e+04   1.31e+04     -1.185      0.236   -4.11e+04    1.01e+04
sentiment:year:school[T.UGST]           2525.6234   9079.922      0.278      0.781   -1.53e+04    2.03e+04
sentiment:year:school[T.USG]            -1.12e+06   1.46e+06     -0.767      0.443   -3.98e+06    1.74e+06
sentiment:year:school[T.VPA]             158.6193    719.610      0.220      0.826   -1251.904    1569.142
sentiment:year:school[T.VPAA]             10.4482     12.005      0.870      0.384     -13.083      33.979
sentiment:year:school[T.VPAF]           1.616e+06   6.91e+06      0.234      0.815   -1.19e+07    1.52e+07
sentiment:year:school[T.VPR]            9.295e+05   2.16e+06      0.429      0.668   -3.31e+06    5.17e+06
sentiment:year:school[T.VPSA]           1.518e+04   1.75e+04      0.869      0.385   -1.91e+04    4.94e+04
sentiment:year:school[T.VPUR]             22.0298    135.518      0.163      0.871    -243.602     287.662
rating:sentiment:year                   -688.1618    774.219     -0.889      0.374   -2205.725     829.402
rating:sentiment:year:school[T.ARCH]   -2293.7222   2978.574     -0.770      0.441   -8132.092    3544.648
rating:sentiment:year:school[T.ARHU]     701.0567    847.883      0.827      0.408    -960.897    2363.010
rating:sentiment:year:school[T.BMGT]   -1024.9489    935.191     -1.096      0.273   -2858.038     808.141
rating:sentiment:year:school[T.BSOS]     362.6377    868.242      0.418      0.676   -1339.223    2064.498
rating:sentiment:year:school[T.CMNS]     802.1658    809.009      0.992      0.321    -783.591    2387.923
rating:sentiment:year:school[T.DIT]    -9407.6274   1.11e+05     -0.085      0.932   -2.26e+05    2.08e+05
rating:sentiment:year:school[T.EDUC]    1882.3208   2111.954      0.891      0.373   -2257.367    6022.008
rating:sentiment:year:school[T.ENGR]     508.7386    923.571      0.551      0.582   -1301.574    2319.051
rating:sentiment:year:school[T.EXST]   -1.795e+06   1.56e+06     -1.153      0.249   -4.85e+06    1.26e+06
rating:sentiment:year:school[T.GRAD]      -7.9048     28.007     -0.282      0.778     -62.802      46.992
rating:sentiment:year:school[T.INFO]    2211.9624   2186.342      1.012      0.312   -2073.536    6497.460
rating:sentiment:year:school[T.IT]     -3.401e+05    5.6e+06     -0.061      0.952   -1.13e+07    1.06e+07
rating:sentiment:year:school[T.JOUR]    3249.4439   1577.484      2.060      0.039     157.382    6341.506
rating:sentiment:year:school[T.LIBR]    4.405e+06   1.43e+07      0.307      0.759   -2.37e+07    3.25e+07
rating:sentiment:year:school[T.PLCY]    3566.6241   4819.608      0.740      0.459   -5880.398     1.3e+04
rating:sentiment:year:school[T.PRES]    3558.7343    1.7e+04      0.210      0.834   -2.97e+04    3.68e+04
rating:sentiment:year:school[T.PUAF]     -11.8609     10.966     -1.082      0.279     -33.355       9.633
rating:sentiment:year:school[T.SPHL]      -3.1089   1199.397     -0.003      0.998   -2354.075    2347.857
rating:sentiment:year:school[T.SVPAAP]  2126.8657   3104.023      0.685      0.493   -3957.398    8211.130
rating:sentiment:year:school[T.UGST]    -333.1963   2225.194     -0.150      0.881   -4694.850    4028.457
rating:sentiment:year:school[T.USG]    -5.601e+06    7.3e+06     -0.767      0.443   -1.99e+07     8.7e+06
rating:sentiment:year:school[T.VPA]      -73.4847    385.215     -0.191      0.849    -828.553     681.584
rating:sentiment:year:school[T.VPAA]      18.3012     39.802      0.460      0.646     -59.716      96.319
rating:sentiment:year:school[T.VPAF]   -3.241e+05   1.38e+06     -0.235      0.814   -3.03e+06    2.38e+06
rating:sentiment:year:school[T.VPR]    -3.044e+05   7.18e+05     -0.424      0.672   -1.71e+06     1.1e+06
rating:sentiment:year:school[T.VPSA]   -2493.1216   3862.732     -0.645      0.519   -1.01e+04    5078.306
rating:sentiment:year:school[T.VPUR]     -74.2660    340.336     -0.218      0.827    -741.365     592.833
==============================================================================
Omnibus:                     2654.981   Durbin-Watson:                   0.461
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             6882.544
Skew:                           0.962   Prob(JB):                         0.00
Kurtosis:                       5.681   Cond. No.                     1.25e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.19e-20. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

This R-squared value, although not terribly close to 1, is much much higher than our previous models. The p-values for our coefficients are also extremely high.

In [148]:
reviews_salaries_df.plot(kind="scatter", x="salary", y="sentiment", labels=labels)

Insights and Conclusions¶

Unfortunetely, we were unable to find any statistically significant pieces of evidence that could support our original hypothesis after performing our analysis. We believe there may be several confounding factors that could have impacted the correlation between a professor's reviews and how much they make. For instance:

  • Professors who also conduct scholarly research might have different salary data points conpared to those who don't
    • This is important because we summed salaries to get a total yearly salary for each professor
    • Future improvements could involve only considering the compensation provided for teaching courses
  • Reviews aren't necessarily the best metric to measure teaching quality in general
    • They're written by students, and could be directly influenced by personal opinions and/or biases regarding the professor, as well as any prior rumours or groupthink about these classes
    • Reviews could also be affected more by the course curriculum for certain data points, rather than the professor and their teaching methods
    • Student reviews tend to lean towards extremes and amplify vocal minorities, which creates a lot of bias and skewed sentiments

Future work on this subject should take student reviews with a grain of salt due to the issues mentioned above. Instead, school-backed metrics, such as course evaluations and gradebook data, should be taken into consideration, as these data sources are generally less self-selective and biased.